6 VITAL Rules for Production-Ready Copilot Agents

recording

ai-agents

prompt-engineering

Professional prompt engineering techniques for deploying AI agents in production environments

Author

Mario Fontana

Published

January 11, 2026

Session Summary: 6 VITAL Rules for Production-Ready Copilot Agents

Recording Date: 2026-01-11
Summary Date: 2026-01-11
Summarized By: Dario Airoldi
Recording Link: YouTube Podcast
Duration: ~48 minutes
Speaker: Mario Fontana (Microsoft)

Executive Summary

This masterclass challenges the misconception that prompt engineering is just “writing well”.

Mario Fontana shares six fundamental rules for designing production-grade AI agents, emphasizing that creating reliable agent behavior is pure behavioral engineering.

The session focuses on practical techniques used in real production projects to control hallucinations and ensure agents execute tasks without improvisation, distinguishing between demo-quality toys and professional production agents.

🚨 00:00 - The Uncomfortable Truth About AI Hallucinations
📐 04:56 - The Invisible Line: Chat Prompts vs Agent Prompts
⚙️ 11:36 - Rule 1: Treat System Prompts as Production Software
🔄 20:41 - Rule 2: Beyond “Think Step by Step” - Structured Reasoning
✂️ 27:47 - Rule 3: Ruthless Reduction - The 10-20% Token Target
🤖 33:27 - Rule 4: Automate Reviews with Agent Reviewers
🧠 39:41 - Rule 5: Separate Behavior from Context
🎯 44:20 - Rule 6: Model-Specific Prompt Optimization
📋 47:56 - Recap and Looking Forward

[00:00] The Uncomfortable Truth About AI Hallucinations

Key Insight: LLMs will always hallucinate - it’s a fundamental mathematical limitation, not a bug to be fixed.

Mario opens with a stark production scenario: an AI agent invents a 40% discount for a new customer contract. The customer has already signed, and there are no error logs - just perfectly constructed, perfectly plausible, perfectly wrong responses.

Core Problem:

LLMs are optimized for statistical probability of the next word, not truth
They maximize “truthfulness appearance,” not correctness
When they don’t find the answer, they improvise and hallucinate

Referenced Research:

2024 Paper: “LLMs Will Always Hallucinate, and We Need to Live With This” (LLMs Will Always Hallucinate)
June 2025 Paper: “On the Fundamental Impossibility of Hallucination Control” (On the Fundamental Impossibility of Hallucination Control in Large Language Models)
- Impossibility Theorem: No inference mechanism can be simultaneously truthful, complete, informative, and optimal
- This is a mathematical limit, not a correctable bug

The Silver Lining: When you have properly designed architecture, you can reduce hallucination frequency. You can’t eliminate them, but you can make them measurable, manageable, and engineerable.

Production Reality: The answer isn’t “change models,” “add more data,” or “hope it goes better.” The answer is treating your system prompt as true application infrastructure.

[04:56] The Invisible Line: Chat Prompts vs Agent Prompts

The Critical Distinction: Most failures occur because developers don’t see the invisible line separating chat prompts from production agent prompts. Crossing this line changes everything.

Example of Dangerous Prompt:

Respond based on provided documents.
If the question doesn't concern documents, don't respond.
Be precise. Don't hallucinate.

Why This Fails in Production:

Doesn’t specify what counts as a source
Doesn’t define what to do when something is missing
Doesn’t specify what’s allowed
Doesn’t explain what to do when sources contradict

Real-World Scenario: When asked “What’s the maximum discount for a new customer?”, the agent:

Searches documents
Finds “15% renewal discount”
Finds nothing about new customer discounts
Must choose: say “I don’t know” (seems unhelpful) or complete/improvise
Generates plausible answer (e.g., 40%) because it sounds truthful

The Paradigm Shift:

Chat Prompt: Temporary, single-moment question
System Prompt: Permanent production configuration
System prompts are production code written in natural language.
Production code must be tested, or it will likely explode

Framework Differences:

Chat Frameworks: CoSTAR (great for conversations)
Agent Frameworks: RTF (Role-Task-Format), guardrail-oriented patterns aligned to how agents process structure

What Professional System Prompts Must Include:

Role & Identity: Clear definition of agent capabilities
Document Scope: What sources are considered authoritative
Response Management: How to handle different question types
- Ambiguous/complex questions
- Out-of-context questions
- Partially correlated questions
Data Gaps: How to handle missing information
Conflict Resolution: What to do when sources contradict
Privacy & Confidentiality: Especially for corporate environments
Quality Control: What to do/not do, what to write/not write
Error Handling: Clear boundaries and how to manage them Professional Response Example: > “I haven’t found the specific policy for new customers. I found 15% for renewals. For undocumented cases, contact the commercial office.”

Why This Works: The agent doesn’t invent or improvise because it has rules that eliminate the need to improvise.

[11:36] Rule 1: Treat System Prompts as Production Software

Core Principle: System prompts are not simple text files - they are essential production components of your solution.

The Promptware Engineering Concept: Recent paper formalizes this: Promptware Engineering: Software Engineering for LLM Prompt Development

Prompts are code written as software
They must be treated with software engineering discipline

Common Anti-Pattern:

system_prompt_v3_final_REALLY_final_use_this_one.txt

This file is a symptom you’re managing your agent’s brain with Word document logic.

Version Control Requirements: System prompts must live in a repository with:

Versions: Clear version tags (e.g., v1.2.3)
Reviews: Code review process before deployment
Traceability: Answer three critical questions when issues occur:
1. When did behavior change?
2. Who changed what?
3. Which tests were passed?
4. Why wasn’t it caught?

What to Version Together: Everything that defines behavior:

Agent instructions
Action references
Policy on when/what agent can call
Fallback rules
Response format
Starter prompts

Why Version Together: In Copilot Studio, behavior is the interaction of 5 tightly correlated variables:

Textual instructions
Tools/Actions
Data in knowledge base
Scope limits
Response format

If you change one without the other, you’ve modified the system without knowing it.

Testing Strategy: Don’t test with “I tried three prompts manually.” Use structured test cases:

Recommended Test Set (20 cases):

5 Easy Cases: Everything should work perfectly
5 Ambiguous Cases: Agent should stop or ask for clarification
5 Out-of-Scope Cases: Agent should refuse correctly
5 Plausible Traps: Seems right but wrong - test hallucination prevention

Evaluation Criteria (Behavioral, not accuracy):

✓ Cited sources correctly
✓ Used “uncertain” language when data missing
✓ Avoided invented numbers
✓ Followed imposed format
✓ Called tools only when allowed

Integration with Analytics: Track KPIs you can read in Analytics:

Errors per scenario
Hallucination rates
Source citation compliance

Deployment Patterns:

1. Canary Deployment (5-25-100 pattern):

Deploy to 5% of users first
If successful, expand to 25%
Finally roll out to 100%
In Copilot Studio: Use Teams channels with specific security groups

2. Ring Deployment (inside-out expansion):

Ring 0 (Internal): Development team
Ring 1 (Beta): Test users
Ring 2 (Production): All users
Native support via Power Platform environments and pipelines

3. Feature Flags:

Turn off new functionality with a click (no redeployment needed)
Soft rollback capability
If agent starts hallucinating, change environment variable
All agents stop using new logic within minutes

Production Safeguards:

Automated test suite runs on every deployment
CI/CD pipelines with approval gates
Git with ready rollback (no stress)
Analytics showing where agents are failing
Test kit integrated into every pipeline release
Control System: Central dashboard for administrators to see all agents, usage, and policy compliance

Bottom Line: A system prompt modification is a release, not a quick edit. Treat it as such.

[20:41] Rule 2: Beyond “Think Step by Step” - Structured Reasoning

The Problem with “Think Step by Step”: This phrase has become a mantra, appearing everywhere in prompts, articles, tutorials, and social media. But in production, this phrase alone can actually be a problem.

Why Generic “Think Step by Step” Fails:

Example Dangerous Prompt:

Respond to user questions about [topic].
Think step by step before responding.

This might help in simple chat scenarios, but it’s not a solution or control mechanism for production agents.

The Real Issue:

The point isn’t to think more
The point is to think the right steps
If the model doesn’t know which steps to take, it produces beautiful reasoning that leads to wrong answers

Real-World Example: Question: “Does a part-time employee transferred from Italy to Germany mid-fiscal year have the right to Christmas bonus?”

What Happens with Generic Prompting: The agent produces a generic checklist that seems reasoned, but the answer is likely wrong because:

This case has many country-specific exceptions
Mid-year transfers have special rules
The model has no concrete procedure to follow
It thought “step by step,” but didn’t know which steps to take
It based reasoning on general knowledge, which is insufficient for HR/legal contexts

The Solution: ReAct Pattern (Reason + Act)

ReAct = Reason and Act (not just “think better”):

Think: Analyze what’s needed
Act: Consult sources, use tools, retrieve policies
Observe: What did you find?
Think: Reason again based on observations
Loop: Continue until complete

Why ReAct Works: It extends reasoning with environmental interaction, making the model more reliable for tasks requiring external data. This is a true agent pattern.

Implementation Strategy: Don’t use ReAct as a phrase to paste. Make it effective through:

Don’t give permission to reason generally
Teach how to reason in specific scenarios
Provide curated examples with domain precedents
Give operational checklists for the domain
Specify the sequence of moves for different scenarios

Professional ReAct Example (HR/Legal Domain):

Prompt Structure:

Think: Which jurisdictions are involved?
Action: Extract mentioned countries (France, Italy, Germany)
Observation: Identified France and Italy as relevant countries
Think: Check if policy exists for mid-year transfers
Action: Search transfer policy database
Observation: No policy found for mid-year transfers
Final Response: "I haven't found the policy for mid-year transfers. 
I found policies for single-nation scenarios. 
I recommend escalation to HR for this undocumented case."

Key Difference:

Doesn’t reason generically
Knows what to do: identify jurisdictions → retrieve policies → signal gaps
Doesn’t invent because it has a procedure that leaves no room for improvisation

How to Implement:

Step 1: Identify 3-5 most critical/ambiguous scenarios Step 2: For each, write complete example:

Question type
Decision checklist
Tools to use
Sources to consult
Final response type

Step 3: Put these examples where the model can always see them, so it uses them as operational precedents

The Real Value: Not in the magic phrase, but in:

The examples you provide
The steps you teach
The Think-Act-Observe loop you design

[27:47] Rule 3: Ruthless Reduction - The 10-20% Token Target

The Paradox: The more examples you add, the more precise you try to be, the more your system prompt grows. But if it grows too much, it doesn’t become more intelligent - it becomes more contradictory.

What Happens with Contradictory Prompts: When a scope is contradictory, the agent doesn’t warn you. It decides probabilistically which rule wins each time.

Common Contradiction Example:

Line 15: “Respond only on contractual clauses in Legal repository”
Line 167: “For workflow questions, use documents in HR repository”
Line 245: “Never respond on topics outside Legal repository”

The Question: Is scope legal-only, or also HR, or never outside Legal? Who decides? Not you, because you don’t see these contradictions. The model decides.

Rule 3 Essence: After initial draft, perform reduction. Target: 10-20% shorter without losing critical information

Sprint-Style Target: Treat each refactor as a sprint goal: reduce 10% tokens without losing clear behaviors

Benefits of Reduction:

More Coherence: Fewer contradictions = fewer hallucinations
More Predictability: Stable behavior
Lower Costs: Real token savings

Reduction Iteration Process:

1. Eliminate Redundancies:

Same concept repeated in multiple sections → keep one, remove others

Example: Instead of writing “technical documentation repository” 15 times:

# Define once at top
TechRepo = Technical Documentation Repository

# Use everywhere else
Search TechRepo for...

Result: 20% fewer tokens, same clarity

2. Resolve Contradictions:

When you find “only X” then “also Y” → choose one unambiguously
Don’t leave the model to decide
If policy is ambiguous to you, it will be ambiguous to the model
Clarify before deployment

3. Optimize Examples:

Don’t show 5 complete identical examples
Show the pattern once (Question → Response)
One fully curated complete example
Additional variations only if they differ meaningfully (Scenario B, Scenario C)
Result: Same value, half the space, more readable

4. Remove Filler Words:

Cut: “please,” “kindly,” “if possible,” “try to”
The model doesn’t need politeness
Every useless word is a wasted token
System prompts need precision, not courtesy
Put courtesy in the user interface, not the system prompt

Test After Each Reduction: Run key scenarios:

Behavior unchanged? ✓ OK
Behavior changed? → Evaluate and adjust

Real-World Example: Legal prompt reduced from 300 lines to 150-180 lines:

Zero redundancy
Zero contradictions
Same capability
Much more predictable behavior
30% fewer tokens per interaction

Benefits Summary:

Better coherence
Fewer hallucinations
Lower costs (shows up in compliance reports)

Advanced Techniques:

Prompt Compression: Frameworks like LangChain compression (aggressive)
Other specialized compression tools
(Note: Mario offers to create dedicated episode if there’s interest)

Practical Philosophy: Precision and iteration, just like code. Every word must earn its place.

[33:27] Rule 4: Automate Reviews with Agent Reviewers

The Problem with Manual Review:

Three Structural Problems:

Time: System prompts with hundreds of lines require 2-3 hours of true attention
Focus Loss: You start losing focus, things slip through
Scalability: Becomes half-day tasks, so you start skipping reviews

The Danger: When you skip review, you’re back to the invented 40% discount scenario.

The Solution: Agent Reviewer

Concept: Create a dedicated agent with a single purpose: criticize and improve your agent instructions

This isn’t a tool - it’s a proper reviewer agent that looks at the same things every time without getting tired:

Redundancies
Contradictions
Ambiguous scope
Missing fallbacks
Overly generic examples
Untested edge cases

Output Structure:

Structured Analysis: Itemized list of problems found
Proposed Version: Improved, more robust version

Time Comparison:

Human Time: 1-3 hours
Agent Time: 30 seconds
Completeness: Finds ALL issues its control system can detect, not just some

This Isn’t a Shortcut: It’s a pattern. You’re using a model to improve the output quality.

The Real Point: Not to have AI write everything, but to have:

Constant review: Every time, every prompt
Repeatable review: Same quality checks
Scalable review: Works on every modification, every agent
No human dependency: Doesn’t depend on someone’s attention span

Building the Reviewer Agent:

Think Like a Professional: The reviewer must enforce discipline:

What to Check: Specific criteria
How to Classify Errors: Severity levels
Where Errors Found: Line numbers, sections
Production Impact: What could go wrong
Concrete Correction: Specific fix proposal

Critical Requirement - Structured Output:

Must be testable
Must be comparable across versions
Enables batch testing
Integrates into pipelines
Can be used in future automated workflows

Uncertainty Handling: If reviewer isn’t sure about a problem, it must explicitly signal uncertainty

Where to Run: Anywhere convenient - just make the decision to do it. The advantage for future agents will be enormous.

Recommended Workflow:

Write system prompt in rough version
Pass to reviewer for analysis
Review output - see what it proposed
Test the recommendations
Evaluate results
Re-pass to reviewer until fully satisfied
Deploy when no critical issues remain

This is a Quality Gate - same logic as software quality gates.

Continuous Improvement: When you discover a new type of error in production:

Add it to reviewer’s control checklist
Reviewer evolves with you
Becomes part of your control platform

Similar to Copilot Control System: Just as Control System is for administrators to monitor agents, the reviewer agent is for developers to control prompt quality.

Available Resources: Microsoft offers prompt coach agents to improve daily requests, or use as template for custom reviewers.

Connection to Rule 1: Remember: No more prompt_reviewer_final_v7_this_time_for_real.txt

It’s code
Treat it as such
Always version controlled

Rule 4 Summary: Automate review in 30 seconds, get a report, stop going by instinct.

[39:41] Rule 5: Separate Behavior from Context

The Core Problem: Even if your prompt is perfect, your agent can still hallucinate for a much more banal reason: context

Many system prompts fail not because of how they reason, but because of how they’re conditioned by their context.

The Mistake: If you put things in the system prompt that the architecture can already provide, you’re paying twice:

Once in complexity
Once in fragility

What Changed: Microsoft Graph Intelligence Layer

Announced: November 2025 at Ignite

This isn’t a feature - it’s an architectural layer between Graph and agents.
It fundamentally changes:

What you should put in agent instructions
Especially what you should NOT put there

Before Intelligence Layer: Many projects hardcoded organizational context in system prompts:

Procedures
Who approves what
Who owns what
Team structures

Why This Seemed Useful: It provides context… until it changes.

The Danger: When a person changes roles, or a team reorganizes, an outdated prompt becomes a potential hazard:

Generates plausible responses
But based on outdated information
No error thrown - just wrong guidance

The More Context You Hardcode:

More risk of contradictions
Harder to maintain coherence across versions
Brittleness increases

Intelligence Layer Architecture:

Three Components:

Data (Microsoft Graph):
- Relationships
- Meetings, chats, work connections
- How work is actually connected
Memory:
- Preferences
- Habits
- User-specific customization (“how you want it”)
Knowledge:
- Organizational structure
- Actual interaction patterns
- How the organization really works

Key Feature: Permission-Aware:

Security stays under control
Traceable at every step
Provides signals to the model
Model decides how to use them based on the prompt

The New Separation:

In System Prompt (Behavior):

What the agent does
How it reasons
How it responds when something is missing

From Intelligence Layer (Context):

Who the user is
What they’re working on
Relevant organizational relationships
Available resources

Example: Legal Agent

Before (Brittle Approach):

If user is in Legal Team → Business tone
If user works on Project X → Prioritize documents Y
...

Problem: You’re transforming the system prompt into a mini corporate directory that becomes outdated.

After (Rule 5 Approach):

Adapt tone based on user's ACTUAL role
Prioritize sources relevant to user's CURRENT work
If role not available OR no relevant sources:
  [Define fallback behavior]

Write the Rule, Not the List:

Don’t repeat organizational context that can be provided dynamically
Define behavior for missing context
Agent must know what to do when it “doesn’t know”

Benefits:

Simpler system prompt
Separates behavior from context
More maintainable
Self-documenting limitations

Critical Principle: Every time you manually write context into the system prompt, you’re building fragility into your system.

[44:20] Rule 6: Model-Specific Prompt Optimization

The Uncomfortable Reality: A “good generic prompt” doesn’t exist. There exists a good prompt for that specific model.

Why This Matters: The prompt that works perfectly on Model A won’t necessarily work the same on Model B. What works today can change tomorrow when you update the model version.

What Changes Between Model Versions:

Sensitivity to certain constraints
How ambiguity is handled
Response patterns
Token interpretation

Common Developer Reaction: “Wait, I didn’t change anything. How is this possible?”

The Analogy: It’s like changing your compiler and expecting identical behavior.

The Rule (Simple but Often Ignored):

Every time you change model or version:

Read the official prompt guide for that specific model
Connect model change to test pipeline updated with latest official guide

Why Official Documentation: Don’t read an article that summarizes characteristics.
Read the official guide because it contains:

What the model follows reliably
What changed from previous version
Model-specific behavior patterns
Recommended prompt structures

Without This Step: You’re making luck-based changes, not engineering-based changes.

Practical Implementation:

Option 1: Use Copilot Chat Open Copilot chat and ask: > “How do I optimize this prompt for the new version?”

Option 2: Update Agent Reviewer Better yet, update your agent prompt reviewer with model-specific knowledge.

Option 3: Model-Specific Reviewers Create dedicated agent prompt reviewers for each model you use regularly.

Multi-Model Architecture in Copilot Studio

Historical Context:

Before: One agent = one model
September 2025: Microsoft announced multi-model selection
May 2025: “Bring Your Own Model” - connect models supported by Studio

Current State (January 2026):

Catalog offers 11,000+ models
Can only use chat-completion compatible models
Critical Capability: Each agent action can use a different model

Strategic Implication: You’re no longer forced to choose “the best model.” You can choose which model for which task.

Example Multi-Model Agent:

Architecture:

Main Agent Instruction: Uses Model A (e.g., GPT-4o)
Long Document Analysis Action: Uses Model B (e.g., Claude with 200K context)
Complex Reasoning Task: Uses Model C (e.g., o1-preview)

Result: Everything works perfectly… until you realize each model needs model-specific prompt optimization.

Anthropic Integration: Microsoft announced Anthropic Claude models joining Copilot Studio lineup, expanding model selection options.

[47:56] Recap and Looking Forward

What We Covered Today - The First Six Rules:

These build the foundation - prompt engineering as architecture, not text; prompt engineering as mindset:

Rule 1: Treat system prompts as production software
- Version control, reviews, structured testing
Rule 2: Move beyond “think step by step” to structured reasoning
- Real examples, domain-specific checklists, ReAct patterns
Rule 3: Ruthlessly reduce prompt length
- Target 10-20% reduction to eliminate contradictions
Rule 4: Automate reviews with dedicated agent reviewers
- 30-second reviews, consistent quality checks
Rule 5: Separate behavior from context
- Use intelligence layer for dynamic context
Rule 6: Optimize prompts for specific models
- Model-specific guidance, test pipeline integration

But This Is Only the Beginning:

Even if you implement all six rules perfectly, the invented 40% discount can still happen. Why? Because this was only the first point of failure.

The Second Point of Failure (More Subtle and Dangerous): It doesn’t concern what the agent thinks - it concerns where the agent gets its truth.

Coming in Episode 2 - Six Advanced Rules:

Topics include:

Explicit Grounding: Forcing agents to cite sources correctly
Guardrail Testing: The right way to test boundaries
Production Monitoring: Real-time hallucination detection
How to minimize risks through engineering, not magic

The Series:

Episode 1: 6 Vital Rules [This session]
Episode 2: 6 Advanced Rules [Coming soon]
Future episodes based on community interest

Core Message: You’ll never eliminate hallucinations, but you can engineer reliability by:

Finding problems before production
Making issues measurable and manageable
Building systems, not hoping for magic

Key Takeaways

Hallucinations Are Fundamental: Mathematical impossibility theorem proves LLMs will always hallucinate - architect around this reality
Chat ≠ Production: The critical distinction between temporary chat prompts and permanent production system prompts
Prompts Are Code: Version control, testing, and deployment discipline are mandatory, not optional
Structure Over Magic Phrases: “Think step by step” alone is dangerous - use structured ReAct patterns with domain examples
Less Is More: Target 10-20% token reduction to eliminate contradictions and improve coherence
Automate Quality: Agent reviewers provide consistent 30-second reviews vs. 2-3 hour manual processes
Context Separation: Don’t hardcode organizational context that changes - use intelligence layer dynamically
Model Specificity: Each model needs its own prompt optimization - no universal prompts exist
Multi-Model Strategy: Different tasks can use different models - optimize each action appropriately
Engineering Mindset: This is behavioral engineering, not creative writing - discipline and testing are essential

Resources

Research Papers:

Microsoft Documentation:

Follow Mario Fontana:

Newsletter: LinkedIn Newsletter
LinkedIn: linkedin.com/in/fontanamario
Podcast YouTube: @ioeilmiocopilot
X (Twitter): @mariofontana
Bluesky: fontanamario.bsky.social

Next Episode: Six Advanced Rules covering explicit grounding, guardrail testing, and production monitoring strategies.